A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions
نویسندگان
چکیده
We present a dataset and method for improving the translation of noisy image captions that were created by users of Wikimedia Commons. The dataset is multilingual but non-parallel, and is several orders of magnitude larger than existing parallel data for multimodal machine translation. Our retrieval-based method pivots on similar images and uses the associated captions in the target language to rerank translation outputs. This method only requires small amounts of parallel captions to find the optimal ensemble of retrieval features based on textual and visual similarity. Furthermore, our method is compatible with any machine translation system, and allows to quickly integrate new data without the need of re-training the translation system. Tests on three different datasets showed that size and diversity of the data is crucial for the performance of our method. On the introduced dataset we observe consistent improvements of up to 5 BLEU points and 3 points in Character F-score over strong neural MT baselines for three different language pairs.
منابع مشابه
Extractive and Abstractive Caption Generation Model for News Images
-This paper provides a model for automatically generating captions for news images, which is used to support development of news media management and many multimedia applications. In the existing method, the captions for the news images are given manually by reading the text content. Thus the caption generation task requires human involvement and hence a time consuming process. The proposed sys...
متن کاملMultimodal Named Entity Recognition for Short Social Media Posts
We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy user-generated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we crea...
متن کاملMultimodal Image Retrieval over a Large Database
We introduce a new multimodal retrieval technique which combines query reformulation and visual image reranking in order to deal with results sparsity and imprecision, respectively. Textual queries are reformulated using Wikipedia knowledge and results are then reordered using a k-NN based reranking method. We compare textual and multimodal retrieval and show that introducing visual reranking r...
متن کاملSPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set
This paper presents an augmentation of MSCOCO dataset where speech is added to image and text. Speech captions are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images. Disfluencies and speed perturbation are added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact ...
متن کاملSTAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset
In recent years, automatic generation of image descriptions (captions), that is, image captioning, has attracted a great deal of attention. In this paper, we particularly consider generating Japanese captions for images. Since most available caption datasets have been constructed for English language, there are few datasets for Japanese. To tackle this problem, we construct a large-scale Japane...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2018